BME 590 Homework 3


In this homework, the objectives are to

  1. Use R to preprocess and examine a biomedical dataset

  2. Implement multiple unsupervised learning methods in a real-world biomedical problem, including: Principal Component Analysis, Hierarchical Clustering, and K-means Clustering in R

  3. Visualize and understand principal components, hierarchical clustering dendrograms, and the outputs of K-means clustering in R

Assignments will only be accepted in electronic format in RMarkdown (.rmd) files and knitted .html files. You must submit both the RMD file and HTML file for each HW. 5 points will be deducted for every assignment submission that does not include either the RMarkdown file or the knitted html file. Your code should be adequately commented to clearly explain the steps you used to produce the analyses. RMarkdown homework files should be uploaded to Sakai with the naming convention date_lastname_firstname_HW[X].Rmd. For example, my first homework assignment would be named 20200911_Dunn_Jessilyn_HW1.Rmd. It is important to note that 5 points will be deducted for every assignment that is named improperly. Please add your answer to each question directly after the question prompt in the homework .Rmd file template provided below.

library(tidyverse)
library(ggplot2)
library(lubridate)
library(patchwork)
library(gridExtra)
library(psych)
library(corrplot)
library(ggfortify)
library(factoextra)

Data Preparation

  1. Download the cancer data titled "HW3_data.csv" from Sakai and import it into R. Look at the first 10 lines of the data to learn about the dataset. The “diagnosis” field shows whether the patient was diagnosed with a benign or malignant tumor. Please read additional information about each column available in the file "HW3_data_info.html", which can also be downloaded from Sakai.
data<-data.frame(read_csv(file = "HW3_data.csv"))
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   diagnosis = col_character()
## )
## See spec(...) for full column specifications.
head(data, 10)
##          id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1    842302         M       17.99        10.38         122.80    1001.0
## 2    842517         M       20.57        17.77         132.90    1326.0
## 3  84300903         M       19.69        21.25         130.00    1203.0
## 4  84348301         M       11.42        20.38          77.58     386.1
## 5  84358402         M       20.29        14.34         135.10    1297.0
## 6    843786         M       12.45        15.70          82.57     477.1
## 7    844359         M       18.25        19.98         119.60    1040.0
## 8  84458202         M       13.71        20.83          90.20     577.9
## 9    844981         M       13.00        21.82          87.50     519.8
## 10 84501001         M       12.46        24.04          83.97     475.9
##    smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1          0.11840          0.27760        0.30010             0.14710
## 2          0.08474          0.07864        0.08690             0.07017
## 3          0.10960          0.15990        0.19740             0.12790
## 4          0.14250          0.28390        0.24140             0.10520
## 5          0.10030          0.13280        0.19800             0.10430
## 6          0.12780          0.17000        0.15780             0.08089
## 7          0.09463          0.10900        0.11270             0.07400
## 8          0.11890          0.16450        0.09366             0.05985
## 9          0.12730          0.19320        0.18590             0.09353
## 10         0.11860          0.23960        0.22730             0.08543
##    symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1         0.2419                0.07871    1.0950     0.9053        8.589
## 2         0.1812                0.05667    0.5435     0.7339        3.398
## 3         0.2069                0.05999    0.7456     0.7869        4.585
## 4         0.2597                0.09744    0.4956     1.1560        3.445
## 5         0.1809                0.05883    0.7572     0.7813        5.438
## 6         0.2087                0.07613    0.3345     0.8902        2.217
## 7         0.1794                0.05742    0.4467     0.7732        3.180
## 8         0.2196                0.07451    0.5835     1.3770        3.856
## 9         0.2350                0.07389    0.3063     1.0020        2.406
## 10        0.2030                0.08243    0.2976     1.5990        2.039
##    area_se smoothness_se compactness_se concavity_se concave.points_se
## 1   153.40      0.006399        0.04904      0.05373           0.01587
## 2    74.08      0.005225        0.01308      0.01860           0.01340
## 3    94.03      0.006150        0.04006      0.03832           0.02058
## 4    27.23      0.009110        0.07458      0.05661           0.01867
## 5    94.44      0.011490        0.02461      0.05688           0.01885
## 6    27.19      0.007510        0.03345      0.03672           0.01137
## 7    53.91      0.004314        0.01382      0.02254           0.01039
## 8    50.96      0.008805        0.03029      0.02488           0.01448
## 9    24.32      0.005731        0.03502      0.03553           0.01226
## 10   23.94      0.007149        0.07217      0.07743           0.01432
##    symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1      0.03003             0.006193        25.38         17.33          184.60
## 2      0.01389             0.003532        24.99         23.41          158.80
## 3      0.02250             0.004571        23.57         25.53          152.50
## 4      0.05963             0.009208        14.91         26.50           98.87
## 5      0.01756             0.005115        22.54         16.67          152.20
## 6      0.02165             0.005082        15.47         23.75          103.40
## 7      0.01369             0.002179        22.88         27.66          153.20
## 8      0.01486             0.005412        17.06         28.14          110.60
## 9      0.02143             0.003749        15.49         30.73          106.20
## 10     0.01789             0.010080        15.09         40.68           97.65
##    area_worst smoothness_worst compactness_worst concavity_worst
## 1      2019.0           0.1622            0.6656          0.7119
## 2      1956.0           0.1238            0.1866          0.2416
## 3      1709.0           0.1444            0.4245          0.4504
## 4       567.7           0.2098            0.8663          0.6869
## 5      1575.0           0.1374            0.2050          0.4000
## 6       741.6           0.1791            0.5249          0.5355
## 7      1606.0           0.1442            0.2576          0.3784
## 8       897.0           0.1654            0.3682          0.2678
## 9       739.3           0.1703            0.5401          0.5390
## 10      711.4           0.1853            1.0580          1.1050
##    concave.points_worst symmetry_worst fractal_dimension_worst age
## 1                0.2654         0.4601                 0.11890  NA
## 2                0.1860         0.2750                 0.08902  NA
## 3                0.2430         0.3613                 0.08758  NA
## 4                0.2575         0.6638                 0.17300  NA
## 5                0.1625         0.2364                 0.07678  NA
## 6                0.1741         0.3985                 0.12440  NA
## 7                0.1932         0.3063                 0.08368  28
## 8                0.1556         0.3196                 0.11510  NA
## 9                0.2060         0.4378                 0.10720  NA
## 10               0.2210         0.4366                 0.20750  NA
  1. Answer the following questions by using the summary function or other methods of your choice:
  1. How many observations are there?
nrow(data)
## [1] 569

Answer a. There are 569 observation in the dataset. b. How many independent variables are there?

total_independent_variables = length(names(data))-2 
#Ignoring the ID
print(total_independent_variables)
## [1] 31

Answer b. Total Independent variable = 31 (Excluding the diagnosis and the patient id) c. Which variables have missing values and how many values were missing in each?

var_name<-names(data)#Variable storing the variables in a data frame
logic_missing=list()
for(i in var_name){
  if(NA %in% data[[i]])
    logic_missing<-append(logic_missing, i)
}

print(logic_missing)
## [[1]]
## [1] "texture_mean"
## 
## [[2]]
## [1] "age"
for(a in logic_missing)
{ print(a)
  print(sum(as.numeric(as.logical(is.na(data[a])))))
}
## [1] "texture_mean"
## [1] 5
## [1] "age"
## [1] 542

Answer c. Hence the data is missing in texture_mean and age variable; there are 5 missing data in texture_mean and 542 missing data in age

  1. How many observations are there with malignant diagnosis and how many are there with benign diagnosis?
summary(as.factor(data$diagnosis))
##   B   M 
## 357 212

Answer d.There are total 357 Benign and 212 Malignant tumour, i.e. 357 and 212 patients with benign and malignant tumour respectively. For this question, please type your answers clearly outside of R chunks, and do not just show the result of running your codes.

  1. There is one column with a very large number of missing observations. Which one is it? We are going to drop this column because it does not make sense to impute observations in this column- why not?. Briefly explain why we should not impute values in this column. Additionally, change the "id" column into the index column (i.e. turn the ID values into row names) and delete the "id" column. Use str() to display the resulting dataframe.
for(a in logic_missing)
{ print(a)
  print(sum(as.numeric(as.logical(is.na(data[a])))))
}
## [1] "texture_mean"
## [1] 5
## [1] "age"
## [1] 542

Answer 3. Age variable has the largest number of NA values(542) i.e 95.4% if data is missing. We should not impute the this column because if we carry out complete case analysis on this, then it will result in removing out lot of rows. If we do other kind of statistical imputationn such as mean imputation or otherwise, then we don't have enough number of data in the age column to get a distribution from which we can chose.

data_modified<-mutate(data, age=NULL)#Removing the age column with large number of NA observations.
data_modified[["id"]]<-c(1:length(data$id))
data_modified<-data_modified %>% rename(Index = id)
str(data_modified)
## 'data.frame':    569 obs. of  32 variables:
##  $ Index                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
  1. After the previous step there are still 5 missing values from a column. Impute these missing values by the method of your choice. Keep in mind that imputation must be performed separately for the two different diagnoses (outcome variable) groups. Briefly explain why we should impute according to the different outcome groups.

Answer 4. First let us change the diagnosis group into factor, so that we can utilise it well. Now I will be applying category wise(Benign/Malignant) mean imputation( at type of statistical imputation) to the missing values of the column 'texture_mean' of the given dataset. The reason we should impute differently accross the different diagnosis group is because it is expected that the given predictor/variable will have different distribution for different categories of diagnosis(be it Benign or Malignant)

data_modified$diagnosis<-as.factor(data_modified$diagnosis)
#Applying mean imputation by groups
categories<-unique(data_modified$diagnosis)
impute_fun<-function(df, x){
  a1<-df %>%filter(diagnosis == x) 
  a1$texture_mean[which(is.na(a1$texture_mean))]<-mean(a1$texture_mean, na.rm= T)
  return(a1)
}
a4<-lapply(categories, function(x)impute_fun(data_modified, x))
row_count<-sapply(a4, nrow)
data_final<-do.call(rbind,a4)
data_final<-arrange(data_final, Index)
  1. After imputation, use "ggplot" and "facet_wrap" to plot a grid of 10 x 3 histograms to explore the data shape and distribution of all the independent variables in this dataset. The dataset has 10 sets of independent variables not counting age, and each set consists of the mean, standard error and worst value of a particular cell measurement. "Area" and "concavity" are cell measurements. For example, "area_se" is the standard error of area measurements from a particular patient in this study. When you plot, remember to select a reasonable number of bins and add legends and labels when appropriate. Adjust the size of the plot display so that you can see all the facets clearly when you knit.
ggplot(gather(data_final[, 3:32]), aes(x=value))+geom_histogram(bins = 20)+labs(x = "Values", y ="Frequency", title="Histograms")+theme(plot.title = element_text(hjust = 0.5))+facet_wrap(~key, scale = 'free_x', ncol=3)

  #ggplot(aes(val, fill = what)) + geom_histogram() + facet_wrap(c(3:32))
  1. If you observe the independent variable distributions closely, groups of variables that start with "area" and "concavity" are consistently strongly skewed to the right. Apply log transform using the same formula we used in the last HW to these 6 variables.
df_final<-mutate_at(data_final, vars(contains("area")|contains("concavity")), function(x)log(x+1))

Gather the above data set for plotting the histogram

ggplot(gather(df_final[, 3:32]), aes(value))+geom_histogram(bins = 20)+labs(x = "Values", y="Frequency", title="Histograms")+theme(plot.title = element_text(hjust = 0.5))+facet_wrap(~key, scale = 'free_x', ncol=3)

  1. The pre-processed dataset needs to be scaled before performing PCA. Can you give a brief explanation on why that is the case? Standardize the dataset. Use summary() again to show that your dataset has been properly standardized.

PCA

Answer 7. The pre-processed dataset need to be scaled before performing PCA. This is so that the PCA projects the original data into mutually independent/orthogonal directions in order to maximise thhe variance in each variable. So a proper scaling results in similar mean values for all the variable(which is zero in this case) and hence the variable projection is only dependent on the difference in variance/standard deviation. It also takes care of differences in units/magnitude of different types of variables.

if(!("robustHD" %in% installed.packages()))
install.packages("robustHD")
library("robustHD")
## Loading required package: perry
## Loading required package: parallel
## Loading required package: robustbase
for(i in c(3:32))
df_final[,i]<-standardize(df_final[, i], centerFun = mean, scaleFun = sd)
summary(df_final)
##      Index     diagnosis  radius_mean       texture_mean     perimeter_mean   
##  Min.   :  1   B:357     Min.   :-2.0279   Min.   :-2.2388   Min.   :-1.9828  
##  1st Qu.:143   M:212     1st Qu.:-0.6888   1st Qu.:-0.7224   1st Qu.:-0.6913  
##  Median :285             Median :-0.2149   Median :-0.1019   Median :-0.2358  
##  Mean   :285             Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:427             3rd Qu.: 0.4690   3rd Qu.: 0.5770   3rd Qu.: 0.4992  
##  Max.   :569             Max.   : 3.9678   Max.   : 4.6596   Max.   : 3.9726  
##    area_mean       smoothness_mean    compactness_mean  concavity_mean   
##  Min.   :-2.8860   Min.   :-3.10935   Min.   :-1.6087   Min.   :-1.1774  
##  1st Qu.:-0.6672   1st Qu.:-0.71034   1st Qu.:-0.7464   1st Qu.:-0.7619  
##  Median :-0.1065   Median :-0.03486   Median :-0.2217   Median :-0.3256  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6198   3rd Qu.: 0.63564   3rd Qu.: 0.4934   3rd Qu.: 0.5746  
##  Max.   : 3.0268   Max.   : 4.76672   Max.   : 4.5644   Max.   : 3.8920  
##  concave.points_mean symmetry_mean      fractal_dimension_mean
##  Min.   :-1.2607     Min.   :-2.74171   Min.   :-1.8183       
##  1st Qu.:-0.7373     1st Qu.:-0.70262   1st Qu.:-0.7220       
##  Median :-0.3974     Median :-0.07156   Median :-0.1781       
##  Mean   : 0.0000     Mean   : 0.00000   Mean   : 0.0000       
##  3rd Qu.: 0.6464     3rd Qu.: 0.53031   3rd Qu.: 0.4706       
##  Max.   : 3.9245     Max.   : 4.48081   Max.   : 4.9066       
##    radius_se         texture_se       perimeter_se        area_se       
##  Min.   :-1.0590   Min.   :-1.5529   Min.   :-1.0431   Min.   :-1.9362  
##  1st Qu.:-0.6230   1st Qu.:-0.6942   1st Qu.:-0.6232   1st Qu.:-0.6865  
##  Median :-0.2920   Median :-0.1973   Median :-0.2864   Median :-0.2568  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2659   3rd Qu.: 0.4661   3rd Qu.: 0.2428   3rd Qu.: 0.5832  
##  Max.   : 8.8991   Max.   : 6.6494   Max.   : 9.4537   Max.   : 4.0749  
##  smoothness_se     compactness_se     concavity_se     concave.points_se
##  Min.   :-1.7745   Min.   :-1.2970   Min.   :-1.1284   Min.   :-1.9118  
##  1st Qu.:-0.6235   1st Qu.:-0.6923   1st Qu.:-0.5833   1st Qu.:-0.6739  
##  Median :-0.2201   Median :-0.2808   Median :-0.1981   Median :-0.1404  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3680   3rd Qu.: 0.3893   3rd Qu.: 0.3708   3rd Qu.: 0.4722  
##  Max.   : 8.0229   Max.   : 6.1381   Max.   :11.0139   Max.   : 6.6438  
##   symmetry_se      fractal_dimension_se  radius_worst     texture_worst     
##  Min.   :-1.5315   Min.   :-1.0960      Min.   :-1.7254   Min.   :-2.22204  
##  1st Qu.:-0.6511   1st Qu.:-0.5846      1st Qu.:-0.6743   1st Qu.:-0.74797  
##  Median :-0.2192   Median :-0.2297      Median :-0.2688   Median :-0.04348  
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.3554   3rd Qu.: 0.2884      3rd Qu.: 0.5216   3rd Qu.: 0.65776  
##  Max.   : 7.0657   Max.   : 9.8429      Max.   : 4.0906   Max.   : 3.88249  
##  perimeter_worst     area_worst      smoothness_worst  compactness_worst
##  Min.   :-1.6919   Min.   :-2.5092   Min.   :-2.6803   Min.   :-1.4426  
##  1st Qu.:-0.6890   1st Qu.:-0.6689   1st Qu.:-0.6906   1st Qu.:-0.6805  
##  Median :-0.2857   Median :-0.1521   Median :-0.0468   Median :-0.2693  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5398   3rd Qu.: 0.6712   3rd Qu.: 0.5970   3rd Qu.: 0.5392  
##  Max.   : 4.2836   Max.   : 3.1371   Max.   : 3.9519   Max.   : 5.1084  
##  concavity_worst   concave.points_worst symmetry_worst   
##  Min.   :-1.4782   Min.   :-1.7435      Min.   :-2.1591  
##  1st Qu.:-0.7766   1st Qu.:-0.7557      1st Qu.:-0.6413  
##  Median :-0.1557   Median :-0.2233      Median :-0.1273  
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.: 0.6201   3rd Qu.: 0.7119      3rd Qu.: 0.4497  
##  Max.   : 3.7763   Max.   : 2.6835      Max.   : 6.0407  
##  fractal_dimension_worst
##  Min.   :-1.6004        
##  1st Qu.:-0.6913        
##  Median :-0.2163        
##  Mean   : 0.0000        
##  3rd Qu.: 0.4504        
##  Max.   : 6.8408

We can notice that the mean value is zero for all the above variables and hence see that the normalisation has been applied properly.

  1. Calculate principal components using function princomp() and print the summary of the results. Answer 8. Applying the principal component analysis to the quantitative predictors/variables i.e. from 3:32
result_pca<-princomp(df_final[, c(3:32)], cor = TRUE)#You can also use the prcomp function instead of princomp with cor=TRUE
summary_pca<-summary(result_pca)
summary(result_pca)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4    Comp.5
## Standard deviation     3.6635805 2.3977788 1.66888098 1.39743453 1.2826898
## Proportion of Variance 0.4473941 0.1916448 0.09283879 0.06509411 0.0548431
## Cumulative Proportion  0.4473941 0.6390388 0.73187763 0.79697174 0.8518148
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     1.09719086 0.78224220 0.68277447 0.62792155 0.58639504
## Proportion of Variance 0.04012759 0.02039676 0.01553937 0.01314285 0.01146197
## Cumulative Proportion  0.89194243 0.91233919 0.92787856 0.94102141 0.95248338
##                            Comp.11     Comp.12    Comp.13     Comp.14
## Standard deviation     0.528068541 0.503843846 0.48160262 0.401349921
## Proportion of Variance 0.009295213 0.008461954 0.00773137 0.005369392
## Cumulative Proportion  0.961778593 0.970240547 0.97797192 0.983341309
##                           Comp.15     Comp.16     Comp.17     Comp.18
## Standard deviation     0.29833386 0.287310199 0.270933816 0.235029902
## Proportion of Variance 0.00296677 0.002751572 0.002446838 0.001841302
## Cumulative Proportion  0.98630808 0.989059650 0.991506488 0.993347790
##                           Comp.19     Comp.20     Comp.21      Comp.22
## Standard deviation     0.19382798 0.181997710 0.175691108 0.1628127344
## Proportion of Variance 0.00125231 0.001104106 0.001028912 0.0008835995
## Cumulative Proportion  0.99460010 0.995704205 0.996733117 0.9976167167
##                             Comp.23      Comp.24      Comp.25      Comp.26
## Standard deviation     0.1474461944 0.1255421684 0.1145291762 0.1070840208
## Proportion of Variance 0.0007246793 0.0005253612 0.0004372311 0.0003822329
## Cumulative Proportion  0.9983413960 0.9988667572 0.9993039883 0.9996862212
##                             Comp.27      Comp.28      Comp.29      Comp.30
## Standard deviation     0.0836331425 4.086534e-02 2.476666e-02 1.164032e-02
## Proportion of Variance 0.0002331501 5.566586e-05 2.044624e-05 4.516570e-06
## Cumulative Proportion  0.9999193713 9.999750e-01 9.999955e-01 1.000000e+00
  1. Plot a scree plot using the screeplot() function.
screeplot(result_pca)

10. Plot the following two plots and use patchwork/gridExtra to position the two plots side by side: a. proportion of variance explained over the number of principal components

car_var<-paste("pc", c(1:30), sep = '')
df_pca<-data.frame(PC_comp = car_var, sdev = result_pca$sdev)
df_pca<-mutate(df_pca,var = (sdev)^2)
a = sum(df_pca$var)
df_pca<-mutate(df_pca,var_prop = var/a)
  1. cumulative proportion of variance explained plot over the number of principal components; draw horizontal lines at 88% of variance and 95% variance.
a = sum(df_pca$var)
df_pca$cum_var<-cumsum(df_pca$var_prop)
p1<-df_pca[c(1:20), ]%>%
  ggplot(aes(x = reorder(PC_comp, -var_prop), y = var_prop))+geom_col()+labs(x = "Principal Components", y = "Proportional Variance", title ="Proportion of Variance(First 20 PC)")+theme(plot.title = element_text(hjust = 0.5))
p2<-df_pca[c(1:20), ]%>%
  ggplot(aes(x = 1:20, y = cum_var))+geom_point()+geom_line()+scale_x_discrete(limits=c(1:20), labels=paste("pc", c(1:20), sep=""))+geom_hline(yintercept=0.88, linetype = "dashed", color ="red")+geom_hline(yintercept=0.95, linetype = "dashed", color="green")+labs(x = "Principal Components", y = "Cumalative Variance", title ="Cumulative Variance(First 20 PC)")+theme(plot.title = element_text(hjust = 0.5))
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
grid.arrange(p1, p2, ncol=2, top="Variance")

Note: please remember to clearly label your plots with titles, axis labels and legends when appropriate.

  1. What proportions of variance are captured from the first, second and third principal components? How many principal components do you need to describe at least 88% and 95% of the variance, respectively?
df_pca$var_prop[c(1:3)]*100
## [1] 44.739406 19.164478  9.283879

Answer 11.: 44.7%, 19.2%, 9.3% proportion of variance captured from the first, second and third components. 6 principal components are needed to capture 88% variance and 10 principal components are needed to capture 95% of variance.

  1. Which are the top 3 variables that contribute the most to the variance captured from PC1, PC2 and PC3 respectively? (hint: look at the loadings information)
pc_final_1<-data.frame(load = result_pca$loadings[,1])%>%arrange(desc(load))
head(pc_final_1)
##                           load
## concave.points_mean  0.2597346
## concavity_mean       0.2574969
## concave.points_worst 0.2521591
## compactness_mean     0.2380904
## perimeter_worst      0.2367356
## concavity_worst      0.2323810

Answer 12. The variables concave points_mean , concavity_mean, concave points_worst contribute most to the variance captured from PC1, PC2 and PC3. 14. Because of the relatively large number of variables in this dataset, it's very difficult to see the biplot clearly. Use the autoplot() function in package "ggfortify" to display a clearer biplot overlaid with a scatter plot for the first 2 principal components.

biplot(result_pca, choices = 1:2)

autoplot(result_pca, data = df_final, x=1, y = 2, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC2")+  theme(plot.title = element_text(hjust = 0.5))
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

15. Plot a grid of 3 x 2 scatter plots, where each plot is a scatter plot between two of the first 4 principal components, with different colors for each diagnosis group. For example, in grid cell (1,1), you should plot a scatter plot where the x-axis is PC1 and y-axis is PC2 where observations are colored according to the diagnosis group. Remember to adjust the plot display size so that you can see it clearly. Add legends and labels when appropriate. What are your observations?

pca12<-autoplot(result_pca, data = df_final, x=1, y = 2, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC2")+  theme(plot.title = element_text(hjust = 0.5))
pca13<-autoplot(result_pca, data = df_final, x=1, y = 3, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC3")+  theme(plot.title = element_text(hjust = 0.5))
pca14<-autoplot(result_pca, data = df_final, x=1, y = 4, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC4")+  theme(plot.title = element_text(hjust = 0.5))
pca23<-autoplot(result_pca, data = df_final, x=2, y = 3, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC2 and PC3")+  theme(plot.title = element_text(hjust = 0.5))
pca24<-autoplot(result_pca, data = df_final, x=2, y = 4, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC2 and PC4")+  theme(plot.title = element_text(hjust = 0.5))
pca34<-autoplot(result_pca, data = df_final, x=3, y = 4, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC3 and PC4")+  theme(plot.title = element_text(hjust = 0.5))
library(gridExtra)
grid.arrange(pca12,pca13,pca14,pca23, pca24, pca34, ncol=2, nrow=3, top = "PCA Analysis of various components")

Answer 15. We can clearly see that for the plots lagging component 1 the two classes overlap with each other and do not seperate well. This may be because the first component captures 44% of variance in the data and is neccessary to obtain distinction between two kinds of diagnosis. We see that most of the variance along the captured along the first principal component divides both the groups/classes at PC_1 = 0(i.e. x = 0) ## Hierarchical Clustering

  1. Calculate a dissimilarity matrix using Euclidean distance and compute hierarchical clustering using the complete linkage method and plot the dendrogram. Use the rect.hclust() function to display dividing the dendrogram into 4 branches.
library("cluster")
df.dissimilarity<-daisy(df_final[, c(3:32)], metric ="euclidean")
df.cluster.complete<-hclust(df.dissimilarity, method = "complete")
plot(df.cluster.complete)

  1. Divide the dendrogram into 4 clusters using cutree() function. Then use table() function and the diagnosis label information to compare the composition (benign vs. malignant) of each of the 4 clusters. How would you label each of these four clusters (e.g. cluster 1 is benign or malignant, cluster 2 is …, etc.)?
df.cluster.4<-cutree(df.cluster.complete,k = 4, h=1)
table(df.cluster.4)
## df.cluster.4
##   1   2   3   4 
##  16 187 364   2
df.cluster.4.diagnosis<-data.frame(Diagnosis = df_final$diagnosis, Cluster = df.cluster.4)
table(df.cluster.4.diagnosis)
##          Cluster
## Diagnosis   1   2   3   4
##         B   0  29 326   2
##         M  16 158  38   0

Answer 18. If you want to clasify each cluster you can go with the class number maximum in that particular cluster. So, Cluster 1 is Malignant Cluster 2 is Malignant, Cluster 3 is benign and cluster 4 is benign. Here 38 Benign out of total 357 Benign have been misclassified(10.7%). And 29 Malignant out of total 212 (13.7%) have been misclassified. Therefore total 67/569 i.e. 11.67% of the total have been misclassified. 19. Now try 6 clusters with and plot dendrograms for hierarchical clustering using Ward’s linkage. Use table() function to view the clustering result. How would you label each of these 6 clusters? Does this clustering work better or worse than the clustering result in the previous question? Give brief explanations.

df.cluster.ward<-hclust(df.dissimilarity, method = "ward.D")
df.cluster.ward.6 <-cutree(df.cluster.ward,k = 6)
table(df.cluster.ward.6)
## df.cluster.ward.6
##   1   2   3   4   5   6 
## 104  58  94 155 140  18
plot(df.cluster.ward)

df.cluster.6.diagnosis<-data.frame(Diagnosis = df_final$diagnosis, Cluster = df.cluster.ward.6)
table(df.cluster.6.diagnosis)
##          Cluster
## Diagnosis   1   2   3   4   5   6
##         B   0   5  53 146 136  17
##         M 104  53  41   9   4   1

Answer 19. Going by the maximum count in a particular cluster. Cluster 1 belongs to Malignant, Cluster 2 belongs to Malignant. Cluster 3 can be classified as Bbenign, although the class of benign exceeds the malignant class by only few numbers. Cluster 4 is Benign, Cluster 5 is Benign and Cluster 6 is Benign. Here 55 Benign out of total 357 (15%) and 5 malignant out of total 212(2.3%) i.e. 60 out of total 569 (10.5%) have been missclassified. When compared to the 4 cluster groups, the misclassification rate is lower in 6 cluster groups, but the problem with six cluster groups is that the group 3 can't be put into any either malignant or benign. Since it includes both of them roughly equal.

K-Means Clustering

  1. Compute k-means clustering on this dataset using the kmeans() function for two clusters. Then use the table() function and the diagnosis label information to compare the composition (benign vs. malignant) of each of the 2 clusters (hint: the cluster information from k-means is stored in the $cluster attribute in the k-means result.)
kc<-kmeans(df_final[, 3:32], 2)
df_kmeans_plot<-data.frame(Diagnosis = df_final$diagnosis, Cluster = kc$cluster)
table(df_kmeans_plot)
##          Cluster
## Diagnosis   1   2
##         B 346  11
##         M  33 179
  1. Visualize the clusters using the fviz_cluster() function from the factoextra package.
library("factoextra")
fviz_cluster(kc, data = df_final[, 3:32])

  1. Between the 3 clustering models we have tried in this homework, which one does the best? What's your observation? ## For PCA cluster the Classificaltion result are
df_pca_comp = data.frame(PCA_Cluster = as.numeric(as.logical((pca12$data$Comp.1)<0)),Diagnosis = df_final$diagnosis )
table(df_pca_comp$Diagnosis, df_pca_comp$PCA_Cluster)
##    
##       0   1
##   B  29 328
##   M 192  20

Misclassified B:29/357(8.12%); M:20/212(9.4%); total:49/569(8.7%) ## For Hiearchial Clustering the results are:

df.cluster.4.diagnosis<-data.frame(Diagnosis = df_final$diagnosis, Cluster = df.cluster.4)
table(df.cluster.4.diagnosis)
##          Cluster
## Diagnosis   1   2   3   4
##         B   0  29 326   2
##         M  16 158  38   0

Misclassified(4 cluster): B:38/357(10.7%); M:29/ 212 (13.7%),total 67/569(11.67% ) Misclassified(6 cluster): B:55/357(15%); M:5/ 212 (2.35%),total 60/569(10.5% ) ## For K-means the results are

table(df_kmeans_plot)
##          Cluster
## Diagnosis   1   2
##         B 346  11
##         M  33 179

Misclassified: B:33/357(9.2%); 11/ 212 (5.88%),total 44/569(7.8% )

Based on Misclassification rate we can see that the K-Means clustering performs best followed by the, PCA Classification followed by the Hiearchial Clustering.